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Abstract 

A major conclusion from comparative genomics is that many sequences that do not code for proteins are conserved beyond 
neutral expectations, indicating that they evolve under the influence of purifying selection and are likely to have functional 
roles. Due to the degeneracy of the genetic code, synonymous sites within protein-coding genes have previously been seen 
as "silent" with respect to function and thereby invisible to selection. However, there are indications that synonymous sites of 
vertebrate genomes are also subject to selection and this is not necessarily because of potential codon bias. We used 
divergence in ancestral repeats as a neutral reference to estimate the constraint on 4-fold degenerate sites of avian genes in 
a whole-genome approach. In the pairwise comparison of chicken and zebra finch, constraint was estimated at 24-32%. 
Based on three-species alignments of chicken, turkey, and zebra finch, lineage-specific estimates of constraint were 43%, 
29%, and 24%, respectively. The finding of significant constraint at 4-fold degenerate sites from data on interspecific 
divergence was replicated in an analysis of intraspecific diversity in the chicken genome. These observations corroborate 
recent data from mammalian genomes and call for a reappraisal of the use of synonymous substitution rates as neutral 
standards in molecular evolutionary analysis, for example, in the use of the well-known d N /6 s ratio and in inferences on 
positive selection. We show by simulations that the rate of false positives in the detection of positively selected genes and 
sites increases several-fold at the levels of constraint at 4-fold degenerate sites found in this study. 

Key words: chicken, turkey, zebra finch, 4-fold degenerate sites, purifying selection, nearly neutral theory, comparative 
genomics. 



Introduction 

Before detailed studies of genetic variation at the DNA and 
protein levels were possible, a common view held that most 
mutations in the genetic material have an effect on fitness 
(Dobzhansky 1970). As a consequence, they were thought 
to either relatively quickly reach fixation by positive selection 
or become removed from the population by negative selec- 
tion. This view was challenged in the 1 960s by the observa- 
tion of significant within-species polymorphism (Hopkinson 
et al. 1 963; Spencer et al. 1 964; Lewontin and Hubby 1 966), 
indicating that some of the variation in the genome might 
be more or less neutral with respect to fitness. Sparked in 
part by such data and armed with a mathematical approach 
based on diffusion equations to derive theoretical argu- 
ments, this soon led Kimura to develop the neutral theory 
of molecular evolution (Kimura 1 968), a model positing that 



genetic drift of neutral alleles is an important driving force in 
evolution. In parallel, it became increasingly clear that a large 
fraction of the genome appears nonfunctional and is 
thereby potentially shielded from selection. DNA was found 
to consist of much else than genes, which were found to 
consist of exons and introns, and cracking the genetic code 
revealed that some positions within exons were "silent" with 
respect to which amino acid is encoded. 

The historical perspective briefly sketched out above is of 
relevance for the development of our current view on ge- 
nome composition and molecular evolution. To some ex- 
tent, the shift in focus of the 1960s and 1970s, from 
natural selection being thought to have a prevailing role 
to acknowledging that neutral processes affect parts of 
the genome, is corroborated and substantiated by recent 
genomic data. This is particularly so, given Ohta's extension 
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of Kimura's model to the nearly neutral theory of molecular 
evolution (Ohta 1973) and the many clear examples that 
slightly deleterious mutations will effectively behave as 
neutral at low effective population sizes (Wright and 
Andolfatto 2008; Ellegren 2009). However, in parallel, there 
has also been an opposite trend. One of the most important 
conclusions from comparative genomics is that many 
regions of the genome previously considered nonfunctional 
show evidence of sequence conservation beyond neutral 
expectations (reviewed by, e.g., Dermitzakis et al. 2005). 
This suggests that several other sequence categories than 
those directly coding for proteins are functional and subject 
to selection. The identification of the numerous ultra- 
conserved elements found in intergenic regions previously 
seen as genomic "deserts" in vertebrate genomes (Bejerano 
et al. 2004; Katzman et al. 2007) are an example of a tran- 
sition in the perception of the genome and how it evolves. 

Due to the degeneracy of the genetic code, synonymous 
substitutions are candidates to represent neutral changes. 
Accordingly, 4-fold degenerate sites have traditionally been 
seen as essentially free of selective constraint (Eyre-Walker 
and Keightley 1999; Nachman and Crowell 2000), at least 
in mammals where effective population sizes are often low 
and where mutations with a small effect on fitness should 
be expected to behave as neutral. The 4-fold degenerate sites 
have therefore been used as a neutral reference both in stud- 
ies of constraint at nonsynonymous sites and in noncoding 
sequences. However, there is evidence that at least some 
silent sites are constrained also in mammals (Chamary 
et al. 2006). Importantly, two recent genome-wide studies 
have reported significant levels of constraint at 4-fold 
degenerate sites of mammalian genes. Eory et al. (2010) 
obtained estimates of 22-27% in hominids and 11-12% 
in murids. Pollard et al. (2010) used alignments of 36 
mammalian genomes to estimate a mean constraint in the 
mammalian phylogeny of 25%. In both these studies, diver- 
gence at ancestral repeats (ARs) was used as a neutral refer- 
ence against which divergence at synonymous sites was 
contrasted. To widen the knowledge of selection at synony- 
mous sites in vertebrates, we here estimate constraint at 
4-fold degenerate sites of avian genes. We use data from 
three available bird genomes (chicken, turkey, and zebra 
finch) and find high levels of constraint: 24-43%. Selection 
on synonymous sites thus seems to be a ubiquitous feature of 
vertebrate genomes. 

Material and Methods 

Sequence Data and Divergence Estimates 

Protein-coding sequences for 1:1 orthologs of chicken and 
zebra finch and for 1:1:1 orthologous of chicken, turkey, 
and zebra finch were downloaded from BioMart (Ensembl 
61) (http://biomart.org) and, in both cases, aligned using 
Mafft 6.716b (Katoh et al. 2009). Pairwise chicken-zebra 



finch alignments of intronic sequences and three species 
EPO alignments of intronic sequences, including information 
about intron coordinates, were downloaded from Ensembl 
(Version 61). Due to alternative splicing, some introns are 
(partly) annotated as exonic sequences and intronic sites 
annotated as exons were removed. Repeats were masked 
using RepeatMasker 3.2.9 (Smit et al. 201 0), and all alignments 
were then cleaned using Gblocks 0.91 b (Castresana 2000) with 
a minimal block length of 30 bases and a maximum number of 
eight contiguous nonconserved positions. Chicken annota- 
tions (Ensembl 61) were used to identify intronic transposable 
elements for the chicken-zebra finch comparison, and we de- 
fined AR as elements present in orthologous positions of 
chicken and zebra finch introns. For the three-species compar- 
ison of chicken, turkey, and zebra finch, we defined AR as el- 
ements present in orthologous positions of introns of all three 
species, again using chicken repeat annotations. 

We masked all CpG dinucleotides by excluding sites pre- 
ceded by cytosine (C) or followed by guanine (G) (CpG- 
prone sites, see Keightley and Gaffney 2003); all divergence 
estimates reported herein are thus non-CpG divergences. 
Divergences were estimated using a general time-reversible 
model with a gamma distribution of variable rate among 
sites (REV model of baseml from the PAML package version 
4.4b; Yang 2007). For two-species alignments, we esti- 
mated divergence in the different sequence categories (0- 
fold and 4-fold degenerate sites, ARs, and introns) for each 
gene separately and then obtained the genome-wide mean 
based on these estimates. For the three-species alignments 
where fewer genes were available, we concatenated all 
gene sequences and obtained a genome-wide divergence 
estimate from the concatenated data. However, in order 
to be able to test for a correlation between divergence 
and gene expression parameters in chicken, we also esti- 
mated divergence at 4-fold degenerate sites for each gene 
separately in the three-species alignments. To reduce the risk 
of incorrectly inferred orthology and to avoid saturation 
problems, genes with a total divergence d > 1 .8 were ex- 
cluded in the pairwise comparison, and genes exceeding 
a divergence of 0.9 in at least one branch were excluded 
from further analysis in the three-species comparison. 

Pairwise estimates of 6 N and d 5 for each gene between 
chicken and zebra finch were taken from Nam et al. (201 0). 

Chicken polymorphisms derived from genome rese- 
quencing of pools of unrelated individuals were obtained 
from (Rubin et al. 2010). In the absence of available allele 
frequency estimates for these data, we used the density 
of single nucleotide polymorphisms (SNPs; number of SNPs 
per base pair) as a measure of polymorphism level. 

Estimates of Selective Constraint 

Selective constraint was estimated using an approach intro- 
duced by Kondrashov and Crow (1993) and extended by 
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Eyre-Walker and Keightley (1999). For the pairwise align- 
ments with divergence estimates obtained for individual 
genes, we used the formula: 

c=1 _ £/=iQ> 

where N is the number of genes analyzed, o, is the observed 
divergence at tested sequence category for gene / (in most 
cases 4-fold degenerate sites) and e,- is the expected diver- 
gence obtained from the divergence estimated for AR of 
gene /. For three-species alignments, we used the ge- 
nome-wide divergence estimates obtained from concate- 
nated sequences to directly estimate constraint as 1 - o/e. 
Note that we only included genes for which data on both di- 
vergence at 4-fold degenerate sites and at one or more in- 
tronic AR were available. This selection was applied to 
exclude the possibility of differences in rate and pattern of 
nucleotide substitution between regions with and without in- 
tronic AR affecting the estimates of constraint. Weighted es- 
timates of constraint were obtained by a method similar to 
the approach of Halligan and Keightley (2006). For estimating 
the weighting factor, we divided the number of alignable 
non-CpG AR sites by the number of all non-CpG AR sites. 
Weighted constraints were estimated dividing the original 
constraint estimates by the weighting factor. 

Simulations of Tests for Positive Selection 

To evaluate the impact of constraint on tests for positive se- 
lection, we simulated sequences using a branch-site model of 
evolution using Evolver from the PAML package (Yang 2007). 
The data were generated using the tree shown in Supplemen- 
tary figure 1 and with two types of selection schemes. In the 
first, we simulated two classes of sites, one evolving under 
w1 = 0.2 in background branches (black lineages in Supple- 
mentary figure 1) and w2 = 1 .0 in foreground branch (gray 
lineage) and the other with a constant w = 0.2 in all the lin- 
eages. In the second scheme, the first class had w1 =0.1 
in background branches and w2 = 2.0 in the foreground 
branch, whereas the second class had a constant w = 0.1 
in all the lineages. In both schemes, we allocated 20% of 
the sites to the first class and 80% to the second. 

To mimic constraint acting on synonymous sites, we 
ignored a fraction (0.25, 0.35, or 0.45) of the synonymous 
substitutions simulated. This was done by comparing the 
simulated sequences with the ancestral sequences — provided 
by Evolver — generated during simulation. Each time 
a synonymous substitution was identified, it was replaced 
by the ancestral state with a probability equal to the constraint. 
In the case of two substitutions per codon, we considered all 
the mutational paths between the two codons, except those 
leading to a stop codon. For simplification, we considered only 
one substitution randomly in case of three substitutions per 
codon and did not consider multiple substitutions at the same 



site. The process was repeated for internal lineages in order to 
apply the same constraint throughout the phylogenetic tree. 
For each simulation scheme, we simulated 200 data sets of 
1,000 codons with and without constraint acting on synony- 
mous sites and applied the branch-site likelihood test of pos- 
itive selection (Zhang et al. 2005). We repeated simulations 
with three different codon frequency spectra obtained for 
chicken-turkey-zebra finch orthologous alignments chosen 
to represent GC-poor, GC-average, and GC-rich genes. 

Gene Expression and Codon Usage 

Median-subtracted arcsinh expression data from 20 differ- 
ent chicken tissues were taken from Chan et al. (2009). 
Gene-wise expression breadth (x) was estimated following 
Yanai et al. (2005): 

EN A _ 

= / = 1 * max 

T N - 1 

where N is the number of tissues, x„ the expression intensity 
for gene x in tissue /, and x max , the maximum expression 
intensity of gene x across all tissues. For analyses of gene 
expression level, we used x max . As a measure for codon bias, 
we calculated the Codon Adaptation Index (CAI; Sharp and 
Li 1 987) obtained for chicken protein-coding sequences us- 
ing CODONW (http://codonw.sourceforge.net). 

Statistical Analyses 

Statistical analyses were performed using R version 2.13.1 (R 
Development Core Team 2008). If not stated otherwise, 
nonparametric bootstrapping (1,000 iterations on concate- 
nated sequences) was used to estimate confidence intervals 
for divergence and level of selective constraint. 

Results 

Estimates of Constraint at 4-Fold Degenerate Sites in the 
Chicken-Zebra Finch Comparison 

We identified 13,245 chicken-zebra finch 1:1 orthologs and 
among these 3,772 genes had at least one intronic AR 
element. After alignment, filtering, and removal of CpG- 
prone sites, the data set could broadly be defined as 
composed of 2.97 million 0-fold degenerate sites, 0.34 mil- 
lion 4-fold degenerate sites, 7.94 million bp of intronic AR, 
and 58.19 million bp of non-AR intron sequence. Divergence 
in these three categories of sequence classes was lowest 
among 0-fold degenerate sites and highest in AR (fig. 1). 
Divergence at 4-fold degenerate sites and at 0-fold sites 
was significantly different from AR divergence (Mann-Whit- 
ney U test, P < 0.001 for both comparisons). If we use AR as 
a neutral reference to estimate constraint in the other se- 
quence categories, 0-fold degenerate sites show an 86.7% 
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Fig. 1. — Estimated sequence divergence of ARs, introns, 4-fold 
degenerated sites, and 0-fold sites in the chicken-zebra finch 
comparison estimated gene by gene. ***Denotes significantly lower 
divergence in comparison to ARs {P < 0.001). 

(±0.6%) and 4-fold degenerate sites a 24.2% (±1 .6%) con- 
straint. Intronic divergence was similar to the AR divergence. 

Compared with the noncoding regions, the high degree 
of sequence conservation within the coding sequence and 
the ability to project DNA sequences onto alignments of pro- 
tein sequences greatly reduce the frequency of gaps and the 
need for filtering of those regions difficult to align. However, 
filtering of regions that are difficult to align within a presum- 
ably neutral sequence may lead to the divergence being 
underestimated because the rapidly evolving sites can be 
excluded from the analysis. One way to handle this potential 
problem is to weight estimated constraint by the proportion 
of sequence removed in the neutral reference. Using a pro- 
cedure similar to Halligan and Keightley (2006) (for details, 
see Material and Methods), we obtained a weighted 
estimate of constraint of 31.5% (±2.0%) for the 4-fold 
degenerate sites. 

Previous work in several different organisms (Lercher 
et al. 2004; Webster et al. 2004), including birds (Axelsson 
et al. 2005), has demonstrated regional consistency in 
nucleotide substitution rates, suggested to reflect regional 
mutation rate variation. This can also be seen in our data 
with a significant correlation in divergence between the 
neighboring genes, both for divergence at 4-fold de- 
generate sites (Breusch-Godfrey serial correlation LM test, 
u t = 26.8599, degree of freedom [df] = 1, P < 0.001) and 
for divergence in ARs (u t = 5.9362, df = 1, P = 0.015). 
However, we found no evidence for the clustering of genes 
in the genome with similar levels of constraint at 4-fold de- 
generate sites (u t = 0.193, df = 1, P = 0.66) nor did the 
constraint at 0-fold sites correlate with the genomic location 
(u t = 0.0148, df = 1, P = 0.90). Thus, selection does 
not display regional variation over the long evolutionary 
time scale analyzed (whereas such effects are expected over 
short time scales, due to the background selection and 
selective sweeps). 



Table 1 

Lineage-Specific Divergence (Mean ± Standard Deviation) of Different 
Sequences Categories Estimated from Concatenated Three-Species 
Alignments of Chicken, Turkey, and Zebra Finch 





4-Fold Sites 


AR 


Chicken 


0.028 (±0.001) 


0.049 (±0.001) 


Turkey 


0.038 (±0.001) 


0.054 (±0.001) 


Zebra finch 


0.302 (±0.005) 


0.399 (±0.004) 



Note. — The lineages are from an unrooted tree of the three species. 



Lineage-Specific Estimates of Constraint in the Chicken- 
Turkey-Zebra Finch Comparison 

Using three-species alignments of available avian genomes 
(chicken, turkey, and zebra finch) allows for lineage-specific 
estimates of divergence and also for polarizing substitutions 
onto lineages by parsimony principles. In addition, the evo- 
lutionary distance between the two galliforms, chicken and 
turkey, is considerably shorter than that between chicken 
and zebra finch, which should facilitate alignment and make 
divergence estimates more accurate. We identified 9,531 
1:1:1 orthologs among the three species, and for 1,667 
of these, there was at least one intronic AR present. 

Table 1 reports the estimated divergences of AR and 4- 
fold degenerate sites in the chicken, turkey, and zebra finch 
lineages. Note that in the unrooted tree of the three species, 
the lineage from the split between chicken and turkey to 
zebra finch ("the zebra finch lineage") includes the basal 
galliform branch, the short Galloanserae and Neoaves inter- 
nal branches, the basal passeriform branch and the terminal 
branch leading to zebra finch (Supplementary figure 2). The 
trends were similar to those obtained for the chicken-zebra 
finch comparison with AR evolving more rapidly than 4-fold 
degenerate sites (table 1). The estimated constraint for 
4-fold degenerate sites was 43.1% (±1.6%), 28.8% 
(±1.8%), and 24.2% (±1.4%) in the chicken, turkey, and 
zebra finch lineages, respectively. Weighting the estimated 
constraints to take regions difficult to align in AR into 
account increases the estimates to 49.9% (±1.9%), 
33.4% (±2.1%), and 28.0% (±1.6%), respectively. These 
estimates are generally higher than what we obtained in 
the more distant chicken — zebra finch pairwise comparison. 

Chicken Polymorphisms 

An analysis of within-species polymorphism essentially 
circumvents potential problems related to the varying 
confidence by which different sequence categories can 
be aligned in distant evolutionary comparisons. It also more 
directly pinpoints the selective constraints of sequence evo- 
lution currently in place. Based on genomic resequencing of 
pools of chicken population samples, Rubin et al. (2010) 
gathered genome-wide data on the location of 7 million 
of SNPs. We calculated the mean density of SNPs (i.e., 
the number of segregating sites divided by the length of 
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Fig. 2. — Gene-by-gene differences between divergence estimates 
of AR and synonymous sites. The dashed horizontal line marks where 
estimates of d s and d AR are equal. Values below the line are genes 
where divergence at AR is estimated higher than divergence at 
synonymous sites (and vice versa for values above the line). The red 
line denotes the lowess curve. 

sequence analyzed) in different sequence categories and 
found that 0-fold degenerate sites had the lowest incidence 
(0.0039 ± 0.0001), followed by 4-fold degenerate sites 
(0.0058 ± 0.0001) and AR (0.0080 ± 0.0002). The density 
of SNPs in AR was significantly higher than in the other 
categories (two-sided Mann-Whitney U test, P < 0.001 
in all cases). The level of polymorphism at 4-fold degenerate 
sites in chicken was 28% lower than in AR. 

Effect of Constraint at 4-Fold Degenerate Sites on 6 N /6 S 
and Tests for Positive Selection 

Constraint at 4-fold degenerate sites has implications for the 
interpretation of selection from the d/v/d 5 ratio and for tests 
of positive selection. To illustrate this, we compared esti- 
mates of d/v/d 5 and d/v/d AR for the 3,772 genes in the 
chicken-zebra finch comparison (d AR being divergence at 
ARs). On average, d N /6 5 estimates (mean 0.133 ± 0.142) 
were about 20% higher than d/v/d AR estimates (0.109 ± 
0.157; two-tailed Mann-Whitney U test, P < 0.001). At 
the level of individual genes, d AR was higher than d 5 for 
72% of the genes. As expected from the relatively low num- 
ber of available sites per gene and the associated stochastic 
influence on divergence estimates, the difference between 
d AR and d 5 was largest for genes with high estimates of d AR 
(fig. 2). 

It is conceivable that selective constraint at 4-fold degen- 
erate sites can affect inferences on positive selection. Specif- 
ically, if 4-fold degenerate sites are constrained, the 
proportion of nonsynonymous sites interpreted to evolve 
more rapidly than the presumed neutral reference should 
be elevated, potentially leading to an increased rate of false 
positives in the detection of positively selected sites and 



genes. To investigate this, we simulated sequence evolution 
under different levels of constraint at 4-fold degenerate sites 
and applied a standard maximum likelihood test of positive 
selection (PAML, branch-site model). The constraint levels 
(25%, 35%, and 45%) were chosen to reflect the range 
of estimates obtained in this study. For simulations without 
positive selection (simulations 1-3 in table 2), the proportion 
of genes detected as positively selected (i.e., false positives) 
increased from 3-9% without constraint to 14-33% with 
constraint. Moreover, the frequency of positively selected sites 
increased from 0.4-0.6% to 0.6-7.9%. Simulations letting 
a fraction of sites evolve under positive selection (see Material 
and Methods) found similar effects. For example, at a level of 
35% constraint at 4-fold degenerate sites, the proportion of 
genes detected as positively selected increased from 47-54% 
to 76-90% and the frequency of positively selected sites 
increased from 4.2-7.4% to 8.1-8.8%. 

As expected, the rate of false positives increased with in- 
creasing constraint. In simulations without positive selec- 
tion, the proportion of genes identified as positively 
selected was on average 2.2 times higher at 25% constraint 
on 4-fold degenerate sites than when such sites were un- 
constrained. For levels of 35% and 45% constraint, this in- 
creased to 3.5 and 7.8 times higher, respectively. For 
identification of positive sites, the frequency was 1.0, 4.6, 
and 14.7 times higher at 25%, 35%, and 45% constraint, 
respectively. Under the more realistic scenario of some sites 
evolving under positive selection, the proportion of genes 
identified as positively selected was 1 .4, 1 .7, and 1 .9 times 
higher at 25%, 35%, and 45% constraint, respectively. The 
corresponding numbers for positively selected sites were 
1 .4, 1.5, and 1 .6 times the higher frequencies. 

Divergence and 4-Fold Degenerate Sites and Gene 
Expression 

In order to search for possible causative correlates of con- 
straint at 4-fold degenerate sites, we considered variables 
related to gene expression: expression level, expression 
breadth, and codon usage. Microarray gene expression data 
is available from chicken so we focused on divergence in 
the chicken lineage using the three-species data set. How- 
ever, divergence at 4-fold degenerate sites was not corre- 
lated to any of the variables tested (gene expression level: 
Pearson r = 0.0008, P = 0.98; breadth of gene expression 
(x): r = -0.0262, P = 0.48; CAI: r = 0.0210, P = 0.57). 

Discussion 

The main conclusion from this study is that 4-fold degener- 
ate sites of avian genes evolve under significant constraint, 
at least when constraint is estimated using divergence in ARs 
as a neutral reference. This conclusion is supported both by 
data on intraspecific polymorphism and interspecific diver- 
gence. In the relatively distant comparison of chicken and 
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Table 2 

Simulation Results for the Proportion of Significant Likelihood Ratio Tests (LRT) for Positive Selected Genes and for the Number of Positively Selected 
Sites with Constraint (Denoted by "Constr.") and without Constraint (Denoted by "No con.") 



Proportions of Significant LRT 



Mean Number of Positively Evolving Sites 



Positive 



25% 
Constraint 



35% 
Constraint 



45% 
Constraint 



25% 
Constraint 



35% 
Constraint 



45% 
Constraint 



Simulation 


GC-Content 


Simulated 


No Con. 


Constr. 


No Con. 


Constr. 


No Con. 


Constr. 


No Con. 


Constr. 


No Con. 


Constr. 


No Con. 


Constr. 


1 


Low 


No 


0.09 


0.14 


0.04 


0.16 


0.05 


0.33 


5.5 


6 


5 


33 


5 


79.5 


2 


Average 


No 


0.04 


0.14 


0.05 


0.17 


0.06 


0.3 


6 


6 


5 


25.5 


4.5 


74 


3 


High 


No 


0.08 


0.15 


0.09 


0.22 


0.03 


0.32 


6 


6 


6 


13 


6.5 


77 


4 


Low 


Yes 


0.53 


0.71 


0.47 


0.86 


0.51 


0.96 


56.5 


71 


59.8 


85.8 


53.3 


90.1 


5 


Average 


Yes 


0.49 


0.73 


0.49 


0.76 


0.45 


0.94 


74 


77 


74 


81 


74 


86 


6 


High 


Yes 


0.53 


0.74 


0.54 


0.9 


0.52 


0.96 


39.5 


70 


42.5 


88 


50.5 


96 



Note. — Results were obtained from simulating 200 data sets of 1,000 codons with and without constraint acting on synonymous sites and applied the branch-site likelihood test 
of positive selection as implemented in PAML. LRT, Likelihood Ratio Tests. 



zebra finch, which are estimated to have diverged 90 mya 
(van Tuinen and Hedges 2001), we obtained constraint es- 
timates of 24-32%. The split between chicken (super order 
Galloanserae) and zebra finch (Neoaves) lineages represents 
the most basal divergence within Neognathae (Supplemen- 
tary figure 2), a group that contains >99% of all extant bird 
species. The deep split of the lineages investigated could 
suggest that the estimated constraint constitutes a represen- 
tative average for birds. However, it should be noted that 
contemporary birds are classified in some 25 orders, most 
of which originated around or soon after the K/T boundary 
(i.e., there are probably only short internal branches, in par- 
ticular within Neoaves; Hackett et al. 2008; Pacheco et al. 
201 1). The lineages that we sampled thus only constitute 
a minor part of evolution among modern birds. 

When using three-species alignments of chicken, turkey, 
and zebra finch, we found constraint for 4-fold degenerate 
sites to be the highest in the chicken lineage (43%) followed 
by the turkey (29%) and the zebra finch lineages (24%). This 
rank order might be seen as surprising if one considers that 
selection against slightly deleterious mutations should be 
more efficient in large populations, that is, the level of con- 
straint should correlate positively with the effective population 
size, N e (Ohta 1973). Although passeriforms (zebra finch) are 
typically small and short-lived birds, galliforms (chicken and 
turkey) are large and long-lived, and it is clear that N e of nat- 
ural populations of passeriforms are typically larger than that 
of galliforms. As a consequence, selection should be more ef- 
ficient and constraint higher in the former than in the latter. 
However, as mentioned above, the zebra finch lineage in the 
unrooted tree of chicken, turkey, and zebra finch includes sev- 
eral internal branches, including basal Galliformes, so the 
large population size of passeriforms is unlikely to be represen- 
tative for the entire zebra finch lineage in the unrooted tree. 

Moreover, just as functional elements within noncoding 
DNA turn over during evolution (Smith et al. 2004; 
Siepel et al. 2005; Pheasant and Mattick 2007), with the 
consequence of the amount of shared functional sequence 



decreasing with increasing genetic distance (Meader et al. 
2010), it is conceivable that the functional importance of 
individual synonymous sites also changes. We may thus ex- 
pect estimates of selective constraint to be lower for more 
distant comparisons. These caveats suggest that the lower 
estimate of constraint at 4-fold degenerate sites in the zebra 
finch than in the chicken and turkey lineages should not be 
taken too far. Yet, it could be noted that Eory et al. (2010) 
found constraint at 4-fold degenerate sites to be lower in 
murids than in hominids, despite the much large effective 
population sizes of the former than of the latter. 

Although comparative genomic studies are powerful in 
detecting purifying selection from sequence conservation 
in particular regions or at particular sites, they cannot reveal 
the underlying functional role of these sequences. However, 
several selective processes may explain why synonymous 
sites are constrained (Chamary et al. 2006), including selec- 
tion for mRNA stability, translational efficiency, and splicing 
regulation (Rocha 2004; Chamary and Hurst 2005; Parmley 
and Hurst 2007; Drummond and Wilke 2008). Moreover, 
there are increasing number of examples where mutations 
at synonymous sites cause human disease, demonstrating 
the critical role of such sites (e.g., Brest et al. 2011). We 
failed to detect a significant correlation between the diver- 
gence at 4-fold degenerate sites and either breadth or level 
of gene expression. Moreover, we did not find a correlation 
between the divergence and codon usage. There is little ev- 
idence for codon usage bias in birds, and the codon adap- 
tation index was not correlated with gene expression level (r 
= 0.0462, P = 0.21), as would have been expected under 
the selection for preferred codons in highly expressed genes 
(Hershberg and Petrov 2008). As selection for codon usage 
is typically weak (Duret 2002), it may very well be that the 
relatively low effective population sizes (A/ e ) of birds means 
that NqS for codon usage is in the neutral range. 

Clearly, any inference of selective constraint in a particular 
sequence category is only relative to a presumed neutral ref- 
erence. Do ARs represent the "ideal" neutral reference? A 
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possible confounding factor is the homogenizing effects on 
sequence evolution of events of nonallelic gene conversion. 
Such events are known to occur among transposable 
elements although their incidence is highest among young 
repeat elements with high sequence similarity (Aleshin and 
Zhi 2010). It may thus not be an issue for ARs present in 
distantly related genomes, like those of chicken and zebra 
finch. Another factor would be positive or negative selection 
at individual elements that have attained functional roles, for 
example, because of exonization (Schwartz et al. 2009; Shen 
et al. 201 1). There is increasing awareness of the importance 
of Alu elements as gene regulators during human evolution, 
however, for birds there has been no such documentation. 
Indeed, avian genomes are low in repeat numbers (Hillier 
et al. 2004), and there is no prominent occurrence of short 
interspersed elements (SINE) (like mammalian Alu elements). 
A third aspect relates to the fact that transposable elements 
tend to be hypermethylated as a host defense mechanism 
against transcription and further spread of repeat elements 
(Yoder et al. 1 997). This will clearly affect the substitution rate 
at CpG sites due to the strong tendency for cytosines at CpG 
sites to be methylated and replaced by thymine upon 
spontaneous deamination (e.g., Holliday and Grigg 1993). 
Divergence at AR-CpG sites should thus be higher than at 
many other CpG sites even though they evolve neutrally. 
However, methylation status does not appear to affect 
the substitution rate at non-CpG sites (Mugal and Ellegren 
201 1 ), which is the rate we studied herein. Overall, conclusions 
reached by recent studies of mammalian genomes suggest 
that AR currently represent the most appropriate neutral ref- 
erence for molecular evolutionary analyses (Thomas et al. 
2003; Lunter et al. 2006; Eory et al. 201 0; Pollard et al. 201 0). 

The rationale for studying the strength of selection at 
nonsynonymous sites by taking the d N /d 5 ratio, rather than 
just d/v, is that scaling d N by d 5 will take variation in nonsy- 
nonymous divergence due to variation in the underlying mu- 
tation rate (supposedly manifested in d 5 ) into account (Hurst 
2002). However, a consequence of constrained evolution at 
4-fold degenerate sites is that the d N /d 5 metric will not be 
a proper measure of the rate of protein evolution. For exam- 
ple, the common inference of neutral evolution when d N /d 5 
= 1 will not be valid. This will clearly need further investi- 
gation, not least because the degree of constraint at synon- 
ymous sites may very well vary among genes and so also the 
effect on d N /d 5 . Related to this, it will be important to ad- 
dress the correlation in constraint between d N and d s . These 
two parameters are obviously correlated due to mutation 
rate variation (Wolf et al. 2009); however, the question is 
if purifying selection adds to the correlation. Moreover, var- 
iation in the level of constraint at 4-fold degenerate sites due 
to variation in N e or life history will have implications to at- 
tempts to evaluate the effects of the same parameters on 
the rate of protein evolution via d N /d s (cf. Wright and 
Andolfatto 2008; Ellegren 2009). 



One important consequence of constraint at synonymous 
sites is that it may impede on the identification of positively 
selected genes and sites. Based on simulations of sequence 
evolution and inferences of positive selection using PAML, 
we found a significant increase in the frequency of false pos- 
itives when constraint at synonymous sites was introduced. 
For example, at 35% constraint and at a GC-content close 
to the genomic average, the frequency of falsely identified 
positively selected genes, as well as falsely identified posi- 
tively selected sites, increased by a factor of 3-5 under a sce- 
nario of no positive selection. Under the same constraint and 
GC-content and with positive selection introduced, the in- 
cidence of false positives increased by a factor of 1.1-1.5. 
We therefore foresee the need for improved maximum like- 
lihood protocols for detection of positive selection that take 
into account deviations from neutrality in the sequence cat- 
egory used as a neutral reference. 

Divergence at synonymous sites is often used as a mea- 
sure of genetic distance between species. Another conse- 
quence of the observation that 4-fold degenerate sites 
evolve under constraint is that this divergence will not repre- 
sent an unbiased distance metric, and this is particularly so if 
the level of constraint varies among lineages. For example, in 
a previous study, we estimated genome-wide mean d 5 in the 
chicken-zebra finch comparison at 0.42, an estimate that 
would be recognized as similar to that typically seen among 
eutherian orders. If we assume that the level of constraint 
at 4-fold degenerate sites is 30%, then a more unbiased 
(neutrality-based) distance would be 0.42/0.7 = 0.60. 

Supplementary Material 

Supplementary figures 1 and 2 are available at Genome Biology 
and Evolution online (http://www.gbe.oxfordjournals.org/). 
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